10 research outputs found

    Wav2vec-based Detection and Severity Level Classification of Dysarthria from Speech

    Full text link
    Automatic detection and severity level classification of dysarthria directly from acoustic speech signals can be used as a tool in medical diagnosis. In this work, the pre-trained wav2vec 2.0 model is studied as a feature extractor to build detection and severity level classification systems for dysarthric speech. The experiments were carried out with the popularly used UA-speech database. In the detection experiments, the results revealed that the best performance was obtained using the embeddings from the first layer of the wav2vec model that yielded an absolute improvement of 1.23% in accuracy compared to the best performing baseline feature (spectrogram). In the studied severity level classification task, the results revealed that the embeddings from the final layer gave an absolute improvement of 10.62% in accuracy compared to the best baseline features (mel-frequency cepstral coefficients)

    Generating speech in different speaking styles using WaveNet

    No full text
    Generating speech in different styles from any given style is a challenging research problem in speech technology. This topic has many applications, for example, in assistive devices and in human-computer speech interaction. With the recent development in neural networks, speech generation has achieved a great level of naturalness and flexibility. The WaveNet model, one of the main drivers in recent progress in text-to-speech synthesis, is an advanced neural network model, which can be used in different speech generation systems. WaveNet uses a sequential generation process in which a new sample predicted by the model is fed back into the network as input to predict the next sample until the entire waveform is generated. This thesis studies training of the WaveNet model with speech spoken in a particular source style and generating speech waveforms in a given target style. The source style studied in the thesis is normal speech and the target style is Lombard speech. The latter corresponds to the speaking style elicited by the Lombard effect, that is, the phenomenon in human speech communication in which speakers change their speaking style in noisy environments in order to raise loudness and to make the spoken message more intelligible. The training of WaveNet was done by conditioning the model using acoustic mel-spectrogram features of the input speech. Four different databases were used for training the model. Two of these databases (Nick 1, Nick 2) were originally collected at the University of Edinburgh in the UK and the other two (CMU Arctic 1, CMU Arctic 2) at the Carnegie Mellon University in the US. The different databases consisted of different mixtures of speaking styles and varied in number of unique speakers. Two subjective listening tests (a speaking style similarity test and a MOS test on speech quality and naturalness) were conducted to assess the performance of WaveNet for each database. In the former tests, the WaveNet-generated speech waveforms and the natural Lombard reference were compared in terms of their style similarity. In the latter test, the quality and naturalness of the WaveNet-generated speech signals were evaluated. In the speaking style similarity test, training with the Nick 2 yielded slightly better performance compared to the other three databases. In the quality and naturalness tests, we found that when the training was done using CMU Arctic 2, the quality of Lombard speech signals were better than when using the other three databases. As the overall results, the study shows that the WaveNet model trained on speech of source speaking style (normal) is not capable of generating speech waveforms of target style (Lombard) unless some speech signals of target style are included in the training data (i.e., Nick 2 in this study)

    Generating speech in different speaking styles using WaveNet

    Get PDF
    Generating speech in different styles from any given style is a challenging research problem in speech technology. This topic has many applications, for example, in assistive devices and in human-computer speech interaction. With the recent development in neural networks, speech generation has achieved a great level of naturalness and flexibility. The WaveNet model, one of the main drivers in recent progress in text-to-speech synthesis, is an advanced neural network model, which can be used in different speech generation systems. WaveNet uses a sequential generation process in which a new sample predicted by the model is fed back into the network as input to predict the next sample until the entire waveform is generated. This thesis studies training of the WaveNet model with speech spoken in a particular source style and generating speech waveforms in a given target style. The source style studied in the thesis is normal speech and the target style is Lombard speech. The latter corresponds to the speaking style elicited by the Lombard effect, that is, the phenomenon in human speech communication in which speakers change their speaking style in noisy environments in order to raise loudness and to make the spoken message more intelligible. The training of WaveNet was done by conditioning the model using acoustic mel-spectrogram features of the input speech. Four different databases were used for training the model. Two of these databases (Nick 1, Nick 2) were originally collected at the University of Edinburgh in the UK and the other two (CMU Arctic 1, CMU Arctic 2) at the Carnegie Mellon University in the US. The different databases consisted of different mixtures of speaking styles and varied in number of unique speakers. Two subjective listening tests (a speaking style similarity test and a MOS test on speech quality and naturalness) were conducted to assess the performance of WaveNet for each database. In the former tests, the WaveNet-generated speech waveforms and the natural Lombard reference were compared in terms of their style similarity. In the latter test, the quality and naturalness of the WaveNet-generated speech signals were evaluated. In the speaking style similarity test, training with the Nick 2 yielded slightly better performance compared to the other three databases. In the quality and naturalness tests, we found that when the training was done using CMU Arctic 2, the quality of Lombard speech signals were better than when using the other three databases. As the overall results, the study shows that the WaveNet model trained on speech of source speaking style (normal) is not capable of generating speech waveforms of target style (Lombard) unless some speech signals of target style are included in the training data (i.e., Nick 2 in this study)

    Convolutional Neural Networks for Classification of Voice Qualities from Speech and Neck Surface Accelerometer Signals

    No full text
    This work was supported by the Academy of Finland (grant number 313390). The computational resources were provided by Aalto ScienceIT.Prior studies in the automatic classification of voice quality have mainly studied support vector machine (SVM) classifiers using the acoustic speech signal as input. Recently, one voice quality classification study was published using neck surface accelerometer (NSA) and speech signals as inputs and using SVMs with hand-crafted glottal source features. The present study examines simultaneously recorded NSA and speech signals in the classification of three voice qualities (breathy, modal, and pressed) using convolutional neural networks (CNNs) as classifier. The study has two goals: (1) to investigate which of the two signals (NSA vs. speech) is more useful in the classification task, and (2) to compare whether deep learning -based CNN classifiers with spectrogram and mel-spectrogram features are able to improve the classification accuracy compared to SVM classifiers using hand-crafted glottal source features. The results indicated that the NSA signal showed better classification of the voice qualities compared to the speech signal, and that the CNN classifier outperformed the SVM classifiers with large margins. The best mean classification accuracy was achieved with mel-spectrogram as input to the CNN classifier (93.8% for NSA and 90.6% for speech).Peer reviewe

    A Comparison of Data Augmentation Methods in Voice Pathology Detection

    No full text
    To distinguish pathological voices from healthy voices, automatic voice pathology detection systems can be built using machine learning (ML) and deep learning (DL) techniques. To fully exploit such systems, large quantities of training data are typically required. The amount of training data is, however, small in the area of pathological voice, and therefore data augmentation (DA) becomes a potential technology to artificially increase the quantity of training data. This study presents a systematic comparison between various DA methods in the detection of pathological voice, including three time domain methods (noise addition, pitch shifting and time stretching), one time-frequency domain method (SpecAugment), and two vocoder-based methods (harmonic-to-noise ratio (HNR) modification and glottal pulse length modification). Detection systems were built using four popular spectral feature representations (static mel-frequency cepstral coefficients (MFCCs), dynamic MFCCs, spectrogram and mel-spectrogram). As classifiers, two widely used ML models (support vector machine (SVM) and random forest (RF)) and two DL models (long short-term memory (LSTM) network and convolutional neural network (CNN) with 1-dimensional (1-D) and 2-dimensional (2-D) architectures) were used. These systems were trained using a small number of training samples from two popular databases of pathological voice (HUPA and SVD) to find the best feature/classifier combination for each database. As a result, one ML-based detection system (mel-spectrogram/SVM for HUPA and SVD) and two DL-based detection systems (dynamic MFCCs/2-D CNN for HUPA and mel-spectrogram/2-D CNN for SVD) were selected for the comparison of the DA methods. The results show that by using DA in the system training, detection accuracy increased compared to the baseline systems that were trained without using DA. This improvement in accuracy was, however, clearly larger for the 2D-CNN system than for the SVM system. Furthermore, all six DA methods improved accuracy of the 2-D CNN system compared to the baseline system for both databases. The highest improvements were achieved using the time-frequency domain SpecAugment DA method, which improved accuracy by 1.5% and 3.8% (absolute) for the HUPA and SVD database, respectively.Peer reviewe

    Investigation of Self-supervised Pre-trained Models for Classification of Voice Quality from Speech and Neck Surface Accelerometer Signals

    No full text
    Prior studies in the automatic classification of voice quality have mainly studied the use of the acoustic speech signal as input. Recently, a few studies have been carried out by jointly using both speech and neck surface accelerometer (NSA) signals as inputs, and by extracting mel-frequency cepstral coefficients (MFCCs) and glottal source features. This study examines simultaneously- recorded speech and NSA signals in the classification of voice quality (breathy, modal, and pressed) using features derived from three self-supervised pre-trained models (wav2vec2-BASE, wav2vec2-LARGE, and HuBERT) and using a support vector machine (SVM) as well as convo- lutional neural networks (CNNs) as classifiers. Furthermore, the effectiveness of the pre-trained models is compared in feature extraction between glottal source waveforms and raw signal wave- forms for both speech and NSA inputs. Using two signal processing methods (quasi-closed phase (QCP) glottal inverse filtering and zero frequency filtering (ZFF)), glottal source waveforms are es- timated from both speech and NSA signals. The study has three main goals: (1) to study whether features derived from pre-trained models improve classification accuracy compared to conven- tional features (spectrogram, mel-spectrogram, MFCCs, i-vector, and x-vector), (2) to investigate which of the two modalities (speech vs. NSA) is more effective as input in the classification task with pre-trained model-based features, and (3) to evaluate whether the deep learning-based CNN classifier can enhance the classification accuracy in comparison to the SVM classifier. The re- sults revealed that the use of the NSA input showed better classification performance compared to the speech signal. Between the features, the pre-trained model-based features showed better classification accuracies, both for speech and NSA inputs compared to the conventional features. The two classifiers performed equally well for all the pre-trained model-based features for both speech and NSA signals. It was alsofound that the HuBERT features performed better than the wav2vec2-BASE and wav2vec2-LARGE features for both speech and NSA inputs. In particular, when compared to the conventional features, the HuBERT features showed an absolute accuracy improvement of 3%–6% for speech and NSA signals in the classification of voice quality.Peer reviewe

    Comparing 1-dimensional and 2-dimensional spectral feature representations in voice pathology detection using machine learning and deep learning classifiers

    No full text
    This work was supported by the Academy of Finland (grant number 313390). The computational resources were provided by Aalto ScienceIT.The present study investigates the use of 1-dimensional (1-D) and 2-dimensional (2-D) spectral feature representations in voice pathology detection with several classical machine learning (ML) and recent deep learning (DL) classifiers. Four popularly used spectral feature representations (static mel-frequency cepstral coefficients (MFCCs), dynamic MFCCs, spectrogram and mel-spectrogram) are derived in both the 1-D and 2-D form from voice signals. Three widely used ML classifiers (support vector machine (SVM), random forest (RF) and Adaboost) and three DL classifiers (deep neural network (DNN), long short-term memory (LSTM) network, and convolutional neural network (CNN)) are used with the 1-D feature representations. In addition, CNN classifiers are built using the 2-D feature representations. The popularly used HUPA database is considered in the pathology detection experiments. Experimental results revealed that using the CNN classifier with the 2-D feature representations yielded better accuracy compared tousing the ML and DL classifiers with the 1-D feature representations. The best performance was achieved using the 2-D CNN classifier based on dynamic MFCCs that showed a detection accuracy of 81%.Peer reviewe

    Gas hydrate phase equilibrium in porous media : mathematical modeling and correlation

    No full text
    International audienceIn this paper, we present two different approaches to represent/predict the gas hydrate phase equilibria for the carbon dioxide, methane, or ethane + pure water system in the presence of various types of porous media with different pore sizes. The studied porous media include silica gel, mesoporous silica, and porous silica glass. First, a correlation is presented, which estimates the hydrate suppression temperature due to the pore effects from the ice point depression (IPD). In the second place, several mathematical models are proposed using the least squares support vector machine (LSSVM) algorithm for the determination of the dissociation pressures of the corresponding systems. The results indicate that although the applied correlation based on the (IPD) leads to obtaining reliable results for the gas hydrate systems in the presence of porous silica glass media, the developed LSSVM models seem to be more general due to their predictive capability over all of the investigated systems

    Gas Hydrate Phase Equilibrium in Porous Media: Mathematical Modeling and Correlation

    No full text
    In this paper, we present two different approaches to represent/predict the gas hydrate phase equilibria for the carbon dioxide, methane, or ethane + pure water system in the presence of various types of porous media with different pore sizes. The studied porous media include silica gel, mesoporous silica, and porous silica glass. First, a correlation is presented, which estimates the hydrate suppression temperature due to the pore effects from the ice point depression (IPD). In the second place, several mathematical models are proposed using the least squares support vector machine (LSSVM) algorithm for the determination of the dissociation pressures of the corresponding systems. The results indicate that although the applied correlation based on the (IPD) leads to obtaining reliable results for the gas hydrate systems in the presence of porous silica glass media, the developed LSSVM models seem to be more general due to their predictive capability over all of the investigated systems
    corecore